wrod2vec是词嵌入的一种方式,具体内容可以参阅论文 Distributed Representations of Words and Phrases and their Compositionality Efficient Estimation of Word Representations in Vector Space 我们需要词语的向量表示来输入到神经网络中进行后续的相关工作。 本章中使用skip-gram model构建word2vec。 Word2Vec Tutorial - The Skip-Gram Model 在skip-gram模型中,通过训练一个单隐层的神经网络来实现这个任务。但是我们关注的不是网络的输出,而是隐藏层的权重矩阵,这个权重矩阵就是我们需要的word vector或embedding matrix。 在这个任务中,我们需要实现给定一句话中的中心词语(center word)来预测该词语附近的词语(上下文,语境)。 给定一个特定的词语,查找附近的词语并随机选择一个。神经网络会输出单词表中每一个词语出现在选定词语附近的概率。
Softmax, Negative Sampling, and Noise Contrastive Estimatio
negative sampling makes certain assumption about the number of noise samples to generate (k) and the distribution of noise samples (Q) (negative sampling assumes that kQ(w) = 1) to simplify computation
相关文章 On word embeddings - Part 2: Approximating the Softmax -Sebastian Rudder NotesonNoiseContrastiveEstimationandNegativeSampling Mikolov等在论文Distributed Representations of Words and Phrases and their Compositionality中提到使用Skip-gram model相比于复杂的hierarchical softmax来说能够更快的训练word2vec并在频繁出现的单词上获得更好的向量表示。 当噪音样本数量增加时,NCE具有negative sampling所缺乏的理论保证。Mnih and Teh(2012)说明了噪音样本数量取25的时候可以获得与常规softmax方法近似的表现,而训练速度却加快了约45倍。 在本章中,由于NCE具有理论保证,使用NCE方法实现word2vec。 最后注意,基于采样的方法只在训练过程中有用,在实际预测的时候,仍然需要用完整的softmax来获得归一化的概率。
# Step 1: define the placeholders for input and output with tf.name_scope("data"):
center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')
target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words')
with tf.device('/cpu:0'):
with tf.name_scope("embed"):
# Step 2: define weights. In word2vec, it's actually the weights that we care about
embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name='embed_matrix')
# Step 3 + 4: define the inference + the loss function
with tf.name_scope("loss"):
# Step 3: define the inference
embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')
# Step 4: construct variables for NCE loss
nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],stddev=1.0 / math.sqrt(EMBED_SIZE)), name='nce_weight')
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')
# define loss function to be NCE loss function
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,biases=nce_bias, labels=target_words, inputs=embed,num_sampled=NUM_SAMPLED, num_classes=VOCAB_SIZE), name='loss')
# Step 5: define optimizer
optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)
面向对象编程
为了提升代码的重用性,使用面向对象的思想
class SkipGramModel:
""" Build the graph for word2vec model """
def __init__(self, params):
passdef _create_placeholders(self):
""" Step 1: define the placeholders for input and output """
pass
def _create_embedding(self):
""" Step 2: define weights. In word2vec, it's actually the weights that we care about """
pass
def _create_loss(self):
""" Step 3 + 4: define the inference + the loss function """
passdef _create_optimizer(self):
""" Step 5: define optimizer """
pass
t-SNE
t-distributed stochastic neighbor embedding (t-SNE) is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten. It is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map. Note that whilst the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate.
from tensorflow.contrib.tensorboard.plugins import projector
# 在训练好词向量后获取embed_matrix
final_embed_matrix = sess.run(model.embed_matrix)
# 创建一个tf.Variable来容纳embeddings,这里不能用constans,也不能用之前模型里定义的embed_matrix.# 获取前500个最流行的单词
embedding_var = tf.Variable(final_embed_matrix[:500],name='embedding')
sess.run(embedding_var.initializer)
cOnfig= projector.ProjectorConfig()
summary_writer = tf.summary.FileWriter(LOGDIR)
# 向config添加embedding
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
# link the embeddings to their metadata file. In this case, the file that contains # the 500 most popular words in our vocabulary
embedding.metadata_path = LOGDIR + '/vocab_500.tsv'# save a configuration file that TensorBoard will read during startup
projector.visualize_embeddings(summary_writer, config)
# save our embedding
saver_embed = tf.train.Saver([embedding_var]) saver_embed.save(sess, LOGDIR + '/skip-gram.ckpt', 1)